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Abstract. In this paper, we develop a dynamic framework for the mod- 
cling and analysis of social networks to work with web documents. We 
illustrate the model with features of web, design a form to analyze rela- 
tionships of attributes as a modality of social structure, and create the 
optimization of generative model based on Bayes Theorem. 
Keywords: conditional probability, graph, similarity, singleton, double- 
ton, framework. 



1 Introduction 

Social network describes a group of social entities and the pattern 
of inter-relationships among them. The concept of social networks is 
designed to map the relationship of entities among all of them that 
can be observed, to mark the patterns of ties between entities, to 
measure social capital: the values obtained by the entities individu- 
ally or in groups, to present a variety of social structures according 
to the interests and its implementation, based on different domains 
or information sources [I]. Group discovery has many applications, 
such as understanding the social structure of organizations or native 
tribes. In law enforcement this is about organized crimes such as 
drugs and money laundering [2] or terrorism [3J, knowing how the 
perpetrators are connected to one another would assist the effort 
to disrupt a criminal act or to identify additional suspects. In com- 
merce, viral marketing exploits the relationship between existence of 
customers and potential customers to increase sales of products and 
services [US]. Members of a social network may also take advantage 
of their connections to get to know others, for instance through web 
sites facilitating networking or dating among their users [6]. 



Social networks explicitly exhibit relationships (called ties in so- 
cial sciences) among individuals and groups (called actors). They 
have studied social sciences since the 1930s. Social scientists have 
conducted extensive research on group detection, especially in fields 
such as anthropology and political science. Recently, statisticians 
and computer scientists have begun to develop models that specifi- 
cally discover group memberships [7f8f9] . There are two models use 
probability for characterizing information sources as well as image, 
has become a tool in machine learning research: probabilistic gener- 
ative and relative models. Therefore, the approaches that addresses 
the issues of social network generally fall into two categories also, and 
this paper explores their relationship in order to obtain a framework 
to investigate social networks by engaging web features. 

2 Related Work and Problem 

Technically, an object called the network is a graph. A graph consist 
of a set of points along with a set of lines connecting pairs of points. 
Formally, a graph denoted by G(V,E), where V ^ is a set of 
vertices or nodes, V = {vi\i = 1,...,/} and E is a set of edges 
connects between pair of vertices, E = {ej\j = 1, . . . , J}. The nodes 
in a social network refer to actor names such as authors, recipients, 
researchers, artists, politicians, firms, organizations, or any entity, 
i.e., 

Definition 1. A set of actors A = {a^i = 1, . . . , 1} and there is a 
function £ such that £ : A -4 V , or Va G A3\v G V . 

Each actor plays some role in the social interactions based on 
his/her background and then achievements obtained in each occa- 
sion, or kinds of: new articles and academic publications. In many 
situations, these are considered as attributes or characteristics, and 
the characteristic can be defined as follows. 

Definition 2. Let Z — {zk\k — 1, . . . , K} is a set of attributes, and 
a pair of (A, Z) is the instance of actors, where Zi is subsets of Z , Zi 
are subsets of attributes of each actor a i; i.e., (a^, Zi), i — 1, . . . , I, 
simply we denotes a set of attributes of actor a as Z a . 



A social network is a network based on the relations between peo- 
ple in their society. Therefore, we can model an approach of other 
social network. When a computer network connects people, it is a 
social network [10] . Just as a computer network is a set of machines 
connected by a set of media (cables or airwaves), a social network 
also is a set of people connected by a set of social relationships such 
as friendship, co-working or information exchange such as Web. In- 
formation of people in web documents is very different from infor- 
mation of people in database. In any documents, the objects (actors 
and attributes) can be given literally, like the literal text of Indone- 
sia, then all meaning of object based on words represented by the 
literal objects itself. To realize it, first we define that a word w is the 
basic unit of discrete data, defined to be an item from a vocabulary 
indexed by {1, . . . , K}, where w k = 1 if k G K, and w k = otherwise. 
Then, we define some instances related to words. 

Definition 3. A document is a sequence ofn words denoted by D = 
{w\, . . . , wn}, where w n is the n-th word in the sequence. Size of 
document is a cardinality of D, i.e., \D\ = N . T> = {D\, . . . ,Dm} is 
a collection of M documents that is called a corpus. 

Definition 4. A termt k consist of at least one word, ortk = (u>i, . . . , 
I < k, k is a number of parameters representing words. \t k \ = k is 
the number of word of t k , and I is the number of vocabularies in t k . 

Any information available on the web may be obtained with the 
help of search engines. The search engine is an important part of 
internet and is one of the easiest and useful tools to research infor- 
mation or to find websites. A search engine allows to categorize and 
make sense of the information that is available online. The search re- 
sults generally presented in a list of results and called as hits, where 
information may consist of texts, images, video, hypertext, etc. 

Definition 5. Let the set of Web pages indexed by search engine 
be fi, i.e., a set contains ordered pair of the term t ki and the web 
page u k ,, (t ki , 0J kj ), i = I,..., I, j = 1,...,J. The relation table 
that consist of two columns tk and Q^, the table is a representation 
of {t ki ,u k .), where fl k = {(**,, w fc ) y } C Q or fl k = {u kl , . . . , ui kj } . 
The cardinality of Q is denoted by \Q\, and uniform mass probability 
function is P : f2 — > [0, 1]. 



Definition 6. Let t x is a search term, and t x G S where S is set of 
singleton search term of search engine. A vector space x C Q is a sin- 
gleton search engine event of Web pages that contain an occurrence 
of t x G u x , and probability of an event x is -P(x) = |x|/|i?| G [0, 1]. 

Proposition 1. Let two singleton events fl x and Q y for search terms 
t x and t y respectively. Q x C\ft y is a doubleton event of t x and t y such 

that P(x,y) = \n x n o y \/\n\. 

Proof. By using intersection operator of set to Definition |6l we have 
a direct conclusion, i.e., -P(x) = |i? x |/|i7| dan P(y) = |J7 y |/|J7| =>• 

p(x,y) = \n x nn y \/\n\. 

A search for a particular index term, say t x , it returns a certain 
number of hits n x , i.e., number of web pages where this term oc- 
curred, we obtain p{t x ) = n x /\Q\. So for (t x , t y ) we have a doubleton 
n xy , and n x > n xy . n xy /\Q\ means that p(t x \t y ) or p(t y \t x ) are the 
conditional probabilities. It is clear that a doubleton is a conditional 
probability of a term for other term. 

We note that in the conditional probabilities the total number of 
web pages indexed by search engine, \Q\ is divided out. Therefore, 
the conditional probabilities are independent of \Q\. The conditional 
probability P(x|y) > means that the search terms t x and t y occur 
together in some web pages or co-occurrence, but also parts of t x or 
t y occur together such that x or y are in bias. 

The edges in a social network refer to ties. A tie relates two 
actors. Ties could be directed or undirected, and they could be di- 
chotomous (present or absent) or valued (weighted). There may be 
many types of ties (e.g., kinship, friendship) and the collection of 
all ties of the same type is a relation. Relations, sometimes called 
strands, are characterized by content, direction and strength. Let R 
is a set of relations, the relations among actors formed by sharing at- 
tributes, ideas, concepts, etc, between them, which can be depicted 
as the intersection between their attributes [TT] as follows 

r k (a, b) = Z a n Z b , r k G R. (1) 

The content of a relation refers to the source that is exchanged, such 
as communication about administrative, personal, work-related or 



social matters. Communication, defined generally as transfer of in- 
formation or resources, is common among socially related people 
whereby the electronic trails of communication can be traced in- 
clude emails [12], newsgroups [32], and instant messaging [13] . In 
the content, sometime we find self- report, links reported by individ- 
ual actors. Such links are directed and naturally subjective, such as 
in classical tools like questionnaires and interviews are based on this 
principle [T3] , homepages or profile pages in community-centric sites: 
LiveJournal weblogs [H] or Facebook [6] commonly display a self- 
professed list of friends within the community. Therefore, a relation 
can be directed or undirected: one person may give social support 
to a second person, or there are two relations here: giving support 
and receiving support. The relations also differ in strength. Such 
strength can be operationalized in a number of ways [TTfTE] . There- 
fore, the types of relations important in social network research, 
it have included the exchange of complex or difficult information 
[19J, emotional support [19 20 |2Tj . uncertain or equivocal communi- 



cation [22.23J, and communication to generate ideas, created con- 
sensus [2~3|25f26|l2T 28 29 30j. support work, forter sociable relations 
[3"Tf3"2] , or support virtual community [33] ■ In general, the strength 
relation generated by similarity measures, and some of them are dice, 
overlap, dan Jaccard [34"|35f36] as follows 

sim(x, y) = WxAy , (2) 
n x + n v 



sim(x, y) = . ^ xAy — r, (3) 
rmn(n x , n y ) 



and 



n 



sim(x, y) = xAy . (4) 
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Similarity has its foundation on the sociological idea that friends 
tend to be alike [37]. Other forms of similarity include having the 
same communication partners [12J and sharing the same opinions or 
areas of interest [3J. Each similarity measure relative to the other, 
and also against the probability, therefore we call these measures as 
probabilistic relative model. We generally define it as 



Definition 7. Probabilistic relative model (PRM) utilizes the Carte- 
sian product for clustering the nodes in A, i.e., 

7 : Ax A -)> R 

such that j(a, b) G R, a,b G A. 

However, this model not only adriff with a bias in the measurement 
but also is difficult to produce descriptions of the relationship. In the 
real world and its application, a social network requires the labels 
with a weight as a modality as well as explaining the roles of each 
actor in the social. We define an approach to generate labels as 
follows 

Definition 8. Probabilistic generative model (PGM) employes a func- 
tion A for classifying Z, i.e., 

\:Z^C 

such that X(z) = c, z G Z , and c G C is a class of labels, where 
C = {ci, C2, . . . , c\c\} is a data set as special target attributes, \C\ > 2 
is the number of classes, and Z R C = 0. 

Statistical natural language processing, an analysis that capture 
the richness of the language contents of the interactions: the words, 
the topics, and other high- dimensional specifics of the interactions 
between actors. Statistically, Bayes theorem has paid dividents in the 
computing world, especially for artificial intelligence and learning, in 
which the PGM has played a role in particular. Some PGMs are a 
direct offspring of Latent Dirichlet Allocation (LDA) [38J, the Multi- 
label Mixture Model [39], and the Author- Topic Model [2TK0] . with 
the distinction that ART is specifically designed to capture language 
used in a directed network of correspondents. However, the PGMs 
concern with extraction of network based on predefined labels only, 
and thus cannot be adapted to the another description of relation. 

3 The Concept of Probability Use 

The Bayes approach defines the classification problem in terms of 
probabilities. There are three main concepts required are conditional 



probability, Bayes theorem, and the bayes decision rule. The condi- 
tional probability (a\D), which is used to define independent events 
of an actor to a document in a corpus, is defined by P(a\D) = P(aU 
D)/P(D), where P(a\D) is the probability that event a E A hap- 
pens, given that D is observed. Similarly, P(D\a) = P(aU D)/P(a), 
where P(D\a) is the probability that event D happens, given that 
a G A is observed. It then follows (by substitution) that P(a C\D) = 
P(a)P(D\a). The premise of Bayes Theorem starts with an initial 
degree of belief that an event will occur, and then with new infor- 
mation about a degree of belief. These two degrees are reresented, 
respectively, by the prior probability P(a\D) and the posterior prob- 
ability P(D\a), which are related by 

P(a)P(D\a) 
P(a\D) = (5) 

The Bayes decision rule states that based on the posterior probabil- 
ities, it is possible to assign an element w to a class with the largest 
probability. For example, let w be a data sample (vector of features) 
and Zi one of the possible classes, then P{w\zi) is prior probability, 
because it can be obtained based on prior knowledge. 

Proposition 2. Let P : D — > [0, 1] is a mass probability function 
whereby the probability of Wi is p(wi) = l/\D\ G [0,1]. If w Vi are 
the vocabularies in document, then p(w Vi ) = \w Vi \/\D\, where \w Vi \ is 
number of Wi in D. 

When working with general web documents in general. We ex- 
plore the features of documents in a corpus. Thus it can be stated 
that the probability P(w) is the conditional probability P(w\D) of 
event w to the document D in the corpus V. Here are the events of 
w in the corpus. 

Lemma 1. Let probability of w Vi in D is p(w Vi ) = \w Vi \/\D\, then 
probability of w Vi inV is 

M I I M -i 
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Set ofp(w Vi ), V = \p(w Vi )\i = 1, . . . , n] is a vector space where p(w Vi ) 
is a weight ofwi. 



Proof. The direct consequence of Definition [3] and Proposition [2j 

Let t a is a search term, each search engine produces n a as hit 
counts, there are as many n a ta £ f2, or for all t a £ cu a , |i? a | = n a , 
t a E n a C Q. Let us define that Q = EjU Nj, where M is a number 
of web pages in Q and Nj are the number of words/terms in Uj. 
Assume that a set of web pages Q as corpus V, then we obtain 

M \t a \ 

P(t a ) = P(a) = \n a \/\n\ = £ E T7W (6) 

j=lk=l IVIIy j 

Therefore, we can define that probability of D £ D is equivalent to 
probability of uj £ Q. 

Proposition 3. Let p is a document D from the corpus D with prob- 
ability P(d), then probability P(uj) = J2fLiNj/\f2\ 

Definition 9. Let t a is a term search. S a = {Si\i = 1, . . . ,n} is a 

list of snippets that are returned by a search engine for t a , where a 
snippet S = {wj\j = 1, . . . , |5|}, and \S\ = ±50 words. 

Each word in document or snippet can take the meaning by giving 
a vector space. One of methods is to use the tf ■ idf, i.e., 

M K 4 | , yr 

tf ■ idf = tf(w) ■ idf{w) = (EE-)( log ^) (7) 

df(w) is the number of document where the word w appearance. 
Normalization of tf ■ idf is tf ■ idf /h, h — a highest score of tf ■ idf. 

Proposition 4. If is a latent class, the z k relates to the document 
D. 

Proof. Each class of Zf~ contains a set of words as characteritics 
of class, and each w with the values is a vector space whereby if 
there are k classes of words then intersection of classes is emptyset, 
Hje/W = 0. In D, based on Lemma [TjVw £ D3\p(w) £ [0,1], so if 
Ww £ U^ =1 w k then 3\p(w) £ [0, 1]. Thus, there is a possibility that 
Zk associated with D. 



Specifically, a document D is potentially related to several topics 
Z with different probabilities, i.e., a set of latent variables Z = 
{zi, . . . , zk}- Therefore, by using same reason we obtain the following 
lemma, where latent variables consequently generate a set of words 
w. 

Lemma 2. If Zk is a latent class, the word w can be generated with 
probability P(w\zk). 

Proposition 5. If Zk is a latent class, the z^ relates to the actor 
ae A. 

Proof. Each actor will use words to communicate with other actors 
in the social, some of communication be recorded in the document, 
so it will relate to the intended actor. When that word also be present 
in every Zk, then we obtain a relation between a and Z}~. 

The latent variables can generate name appearances A that are 
closely related to a specific topic. 

Lemma 3. // z\. is a latent class, the actor w can generated with 
probability P(a\zk). 

A concept treats document-name-word as a triplet (D, a, w) is to 
represent an instance that a name a appears in document D, which 
contains the word w. The relationship inherent in this concept is 
associated by a set of topics Z, where a set of latent variables z 
can break the direct relationships between documents, words, and 
names. 

Theorem 1. Let Z is a set of latent variables, Z = {zi\l = 1, . . . , L}, 
with size L, each of which represents a latent topic, if and only if the 
relationships between D G T>, a G A, and w G D are connected by 
Z. 

Proof. A direct result of some of previous lemmas. 
4 The Framework 

There are five steps of Bayesian methodology [12] to classify some- 
thing based on Definition [HJ (i) Collect data, and estimate parame- 
ters such as mean and coveriance for each class. In this case, assume 



that all the probability density functions have a Gaussian behavior, 
(ii) Choose a set of features, (iii) Choose a mode and derive a de- 
cision rule with these parameters, (iv) Tray the classifier and apply 
the decision rule by using a discriminant function, and apply it to a 
test data set to classify each sample, (v) Evaluate the decision rule. 
Measure the accuracy /error rate in order to improve the choice of 
features and the overall design of the classifier. 

In the exploration of social networks that involves web pages, the 
ties is not just acquired from the hit count, but other features. There 
are features such as web-snippets that are not only composed a col- 
lection of words, a part of them refers to the person names, but there 
are also the addresses of the web page, URL. For each search term 
t a , represents a name a G A, the list of snippets S will be obtained 
where a G S C S. Therefore, we can considered the results returned 
by search engines based on the name as a probability p(u\a). Fur- 
thermore, each list of snippets can be modeled as a bag of words to 
generate a vector space in which their weight is obtained by Equa- 
tion (J7J), and this can be interpreted as the conditional probability, 
i.e., P(w\a), P(w\S), and P(w\S) which is meaningful as the current 
context [36] of each person name. 

We can provide a latent class z by using the clusters of words 
from the snippets. For every a G A, S a is potentially a description of 
the actors and the relations between the actors by using Definitional 
Proposition [JJ and Equation OH) so that the current context is in the 
trees as the optimal form of relationship between words and person 
name a G A. In this case, once provided the topics for example based 
on the existing research group, the trees of words can be selected 
based on their proximity to the topics by involving singleton and 
doubleton. The selected tree of words will be a description of the 
latent variable z. Under this scenario, we can derived the aspect 
model that involves joint probability over D x a x w is expressed as 
mixture 

P(S, a, w) = P(S)P(a,w\S) (8) 
P(a, W \S) = Y, P(a, w\z)P{z\S) (9) 

The latent variables z connected each instance of which can be 
obtained from the list of snippets such as names, words and docu- 



ments, but every instance that was also having relations with one 
another. Therefore, based on Theorem HJ the optiomal form of this 
relationship is only connected by a latent variables with disconnect- 
ing from each relation between instances, and an symmetric model 
can be parameterized by 

P(S, a,w) = J2 P(z)P(S\z)P(w\z)P(a\z). (10) 
5 Conclusion and Future Work 

We have proposed a novel framework for acquiring the social net- 
works from snippets. We will demonstrated this after getting the 
Expectation-Maximization (EM) of the model. 
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